Parallel tree-projection-based sequence mining algorithms
نویسندگان
چکیده
Discovery of sequential patterns is becoming increasingly useful and essential in many scientific and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand efficient, scalable, and parallel algorithms. Even though a number of algorithms have been developed to efficiently parallelize frequent pattern discovery algorithms that are based on the candidategeneration-and-counting framework, the problem of parallelizing the more efficient projection-based algorithms has received relatively little attention and existing parallel formulations have been targeted only toward shared-memory architectures. The irregular and unstructured nature of the task-graph generated by these algorithms and the fact that these tasks operate on overlapping sub-databases makes it challenging to efficiently parallelize these algorithms on scalable distributed-memory parallel computing architectures. In this paper we present and study a variety of distributed-memory parallel algorithms for a tree-projection-based frequent sequence discovery algorithm that are able to minimize the various overheads associated with load imbalance, database overlap, and interprocessor communication. Our experimental evaluation on a 32 processor IBM SP show that these algorithms are capable of achieving good speedups, substantially reducing the amount of the required work to find sequential patterns in large databases.
منابع مشابه
Parallel Formulations of Tree-Projection Based Sequence Mining Algorithms
Discovery of sequential patterns is becoming increasingly useful and essential in many scientific and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand efficient, scalable, and parallel algorithms. Even though a number of algorithms have been developed to efficiently parallelize frequent pattern discovery algorithms that are based on the...
متن کاملParallel Formulations of Tree-Projection-Based Sequence Mining Algorithm
Discovery of sequential patterns is becoming increasingly useful and essential in many scientific and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand efficient, scalable, and parallel algorithms. Even though a number of algorithms have been developed to efficiently parallelize frequent pattern discovery algorithms that are based on the...
متن کاملParallel Tree Projection Algorithm for Sequence Mining
Discovery of sequential patterns is becoming increasingly useful and essential in many scienti c and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand e cient and scalable algorithms. In this paper we present two parallel formulations of a serial sequential pattern discovery algorithm based on tree projection that are well suited for dis...
متن کاملDynamic Load Balancing Algorithms for Sequence Mining
Discovery of sequential patterns is becoming increasingly useful and essential in many scienti c and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand e cient and scalable algorithms. In this paper we present a parallel formulation of a serial sequential pattern discovery algorithm based on tree projection that uses a novel dynamic load ...
متن کاملScalable Data Mining for Rules
Data Mining is the process of automatic extraction of novel, useful, and understandable patterns in very large databases. High-performance scalable and parallel computing is crucial for ensuring system scalability and interactivity as datasets grow inexorably in size and complexity. This thesis deals with both the algorithmic and systems aspects of scalable and parallel data mining algorithms a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Computing
دوره 30 شماره
صفحات -
تاریخ انتشار 2004